Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, we propose a decoupled two-stage domain adaptation framework that separates the adaptation process into text-only and audio-only stages. In the first stage, we leverage abundant text data from the target domain to refine the LLM component, thereby improving its contextual and linguistic alignment with the target domain. In the second stage, we employ a pseudo-labeling method with unlabeled audio data in the target domain and introduce two key enhancements: (1) incorporating decoupled auxiliary Connectionist Temporal Classification (CTC) loss to improve the robustness of the speech encoder under different acoustic conditions; (2) adopting a synchronous LLM tuning strategy, allowing the LLM to continuously learn linguistic alignment from pseudo-labeled transcriptions enriched with domain textual knowledge. The experimental results demonstrate that our proposed methods significantly improve the performance of LLM-ASR in the target domain, achieving a relative word error rate reduction of 19.2%.
Loading....